Overview

Two studies were conducted:

  1. TT - Transfusion Trigger (= TRIBE)
  1. PCR SEPSIS

Question addressed here:

Can we use this body of patient information to predict infection?

This work is exploratory “proof of concept” to see if there is something here worth more investigation.

Notes


TT Study EDA

Question 0: Describe who was screened for blood infection and when

There are three variables indicating time of blood-level screening in the data, in addition to the vital screening variable (V_TIME_PERFORMED). They refer to blood CBC (V_TIME_PERFORMED_1), blood chemistry (V_TIME_PERFORMED_2), and blood gas (V_TIME_PERFORMED_3) respectively, according to the variable descriptions.

Almost all entries have vitals (V_TIME_PERFORMED, 33 NA), and most have blood CBC (V_TIME_PERFORMED_1, 2186 NA) and blood chemistry (V_TIME_PERFORMED_2, 3542 NA). Only about one third have V_TIME_PERFORMED_3 (9111 NA) out of 14852 entries.

Only two entries have Blood infection reported when neither Blood CBC and Chemisty were performed, but in general (the other 120), this seems to be a precursor for detection.

Question 1: Describe the type, frequency and patterns of infections in study participants

Observations:

#Correlation of having such an infection
round(cor(TT_per[,c("n_blood","n_urine","n_wound", "n_pneumonia", "n_any")] > 0, use = "pairwise.complete.obs"), 2)
##             n_blood n_urine n_wound n_pneumonia n_any
## n_blood        1.00    0.30    0.21        0.29  0.55
## n_urine        0.30    1.00    0.18        0.25  0.44
## n_wound        0.21    0.18    1.00        0.17  0.39
## n_pneumonia    0.29    0.25    0.17        1.00  0.68
## n_any          0.55    0.44    0.39        0.68  1.00
#Correlation of number of such infections
round(cor(TT_per[,c("n_blood","n_urine","n_wound", "n_pneumonia", "n_any")], use = "pairwise.complete.obs"), 2)
##             n_blood n_urine n_wound n_pneumonia n_any
## n_blood        1.00    0.42    0.45        0.43  0.72
## n_urine        0.42    1.00    0.39        0.52  0.69
## n_wound        0.45    0.39    1.00        0.39  0.70
## n_pneumonia    0.43    0.52    0.39        1.00  0.86
## n_any          0.72    0.69    0.70        0.86  1.00

The next five plots show all individuals in the study, grouped by their study outcome. From the day of their first collection, their No/Yes/NA infection status is tracked. Within each outcome group, individuals are sorted by their number of days observed.

Question 2: Describe the distribution of patient burn severity and connection to outcomes

Observations:

Question 3: Individual vital signs over time

Observations

When does the first infection occur?

#Days from first collection to first infection
table(TT_per$first_any)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  15  16  17  18 
##   5  10  10  11   7  11   9  12   4   8   6   6   6   6   6   2   6   4 
##  19  20  22  23  24  25  26  29  30  31  33  34  43  46  52  85 118 
##   2   1   2   1   1   2   3   1   2   2   1   1   4   1   2   1   1
#Days from first collection to first blood infection
table(TT_per$first_blood)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  15  16  17  19  20  21 
##   2   5   1   1   3   2   5   1   5   2   2   3   4   1   3   1   1   2 
##  24  25  26  27  30  33  38  39  40  43  50  52  56  70  73  85  86 117 
##   1   1   3   1   2   1   2   2   1   2   1   1   1   1   1   1   1   1 
## 118 363 
##   1   1
days_from_admit_to_collection = difftime(TT_per$first_collection_date, TT_per$Admit_date, units = "days")
#Days from admit to first collection
table(days_from_admit_to_collection)
## days_from_admit_to_collection
##   0   1   2   3   4 
## 225 104   9   6   2
#Days from admit to first infection
table(TT_per$first_any + days_from_admit_to_collection)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  15  16  17  18 
##   3   5  10   9  12  14   8   9   9   6   8   4   7   7   3   4   6   4 
##  19  20  21  22  23  24  26  27  30  31  33  38  43  44  46  52  54  85 
##   1   1   1   1   2   2   4   1   2   3   1   1   3   1   1   1   1   1 
## 118 
##   1
#Days from admit to first infection
table(TT_per$first_blood + days_from_admit_to_collection)
## 
##   2   3   4   5   6   7   8   9  10  11  12  13  15  16  17  18  19  20 
##   3   4   2   1   4   3   3   4   2   1   4   1   1   3   3   1   1   1 
##  21  24  26  27  30  31  33  38  39  40  41  43  50  52  56  70  76  85 
##   2   1   4   1   1   1   1   1   1   2   1   2   1   1   1   1   1   1 
##  86 117 118 363 
##   1   1   1   1

Correlation of vital signs

The charts below are for data in aggregate. Individual-level data may be more telling.

Track many patient vitals over time for an individual patient.

The red vertical line is the first blood infection and the blue is any infection.

Track an individual vital over time for many patients.

The red vertical line is the first blood infection and the blue is any infection.

Restrict to individuals with at least one blood infection

Question 4: (pre-modeling) Which individual vital signs over time may be predictive?

Compare infection days vs. day-before infection vs. neither for individuals who have at least one infection, NOT restricted to first infection

Distributions of vital statistics by infection status excluding outliers.

Based on Figure below we’d suspect that some variables may be correlated with infection or pre-infection, including heart rate, temperature, platelet count and sodium.

There are a few hundred NA entries (out of about 15,000) in “onset_tomorrow”, most due to not having next-day readings for an individual, which means that the data is not missing at random. Some NA data is also due to NA entries in the onset variable.

Now RESTRICT to first infection

Distributions of vital statistics by infection status excluding outliers.

Distributions of vital statistics by infection status excluding outliers.

Now look at the period before infection (1, 2 or 3 day) instead of only the day before, comparing with the first infection day and non-infection days, still removing days after the first infection (note color change)

Now for BLOOD INFECTION ONLY, compare infection days vs. day-before infection vs. neither for individuals who have at least one infection, NOT restricted to first infection

Distributions of vital statistics by infection status excluding outliers.

Distributions of vital statistics by infection status excluding outliers.

Investigation of ICU Days and PDR (predicted death rate) show that those with current or iminent infection are generally in worse health than the other groups. This is important for recognizing that factors like elevated heart rate may be due to general poor health instead of impending infection.

Next we look at the observed patterns more formally with a multinomial model where the possible outcomes are as labelled in Figure - current infection, day before infection onset, and neither of those cases. The NA case is exluded. I used a penalized version of multinomial regression from glmnet in with cross-validation to select variables. The tables below show the fitted non-zero coefficients using a less restrictive and more restrictive penalty term. The input data was standardized before fitting the model so that the coefficient magnitudes would be comparable, though they lose interpretability as a result.

Decision Trees

Multinomial Models:

## [1] "1. Model with all data"
## [1] "Coefficients -- smaller penalty"
##                       Current infection Day Before Infection Neither
## (Intercept)                     -8.0550              -0.6922  8.7472
## V_FIO2                           0.0205               0.0252 -0.0457
## V_HEART_RATE                     0.0519               0.0401 -0.0919
## V_GLUCOSE                        0.0525               0.2889 -0.3414
## V_BLOOD_UREA_NITROGEN            0.0616               0.0111 -0.0727
## V_WHITE_BC                       0.0704               0.1645 -0.2349
## V_RESPIRATORY_RATE               0.1183               0.0194 -0.1378
## V_MODS_SCORE                     0.2207               0.0299 -0.2506
## V_SODIUM                         0.6340               0.5607 -1.1948
## V_TEMPERATURE                    6.0934              -1.3635 -4.7299
## [1] "Coefficients - larger penalty (lambda.1se*.8)"
##               Current Infection Day Before Infection Neither
## (Intercept)             -0.9785              -1.0297  2.0082
## V_GLUCOSE                0.0066               0.0147 -0.0213
## V_WHITE_BC               0.0259               0.0430 -0.0689
## V_MODS_SCORE             0.0622               0.0249 -0.0871
## V_TEMPERATURE            0.1207              -0.0047 -0.1160
## [1] "2. Model with data to first infection and infection days"
## [1] "Coefficients -- smaller penalty"
##                         Current Infection Day Before Infection Neither
## (Intercept)                       -3.4271              -1.8108  5.2379
## V_HEMOGLOBIN                      -1.3027               0.3129  0.9898
## V_PAO2                            -0.4164              -0.1168  0.5333
## V_MEANARTERIAL_PRESSURE           -0.4117               0.1465  0.2652
## V_PB_SYSTOLIC                     -0.2600               0.1547  0.1053
## V_CREATININE                      -0.2324              -0.1867  0.4190
## V_GLUCOSE                          0.0006               0.0022 -0.0028
## V_FIO2                             0.0168               0.0003 -0.0171
## V_MODS_SCORE                       0.0525               0.0416 -0.0941
## V_PLATELET_COUNT                   0.0879              -0.0291 -0.0588
## V_PACO2                            0.1347               0.0171 -0.1518
## V_RESPIRATORY_RATE                 0.2895               0.0348 -0.3242
## V_WHITE_BC                         0.3048              -0.0089 -0.2959
## V_POTASSIUM                        0.4361               0.0519 -0.4880
## V_BLOOD_UREA_NITROGEN              0.4417              -0.1387 -0.3030
## V_HEART_RATE                       0.4529               0.1509 -0.6038
## V_GLASCOWCOMA_SCALE                0.8093              -0.6870 -0.1223
## V_SODIUM                           1.0240               1.4221 -2.4461
## V_TEMPERATURE                      1.9304              -0.6288 -1.3016
## [1] "Coefficients - larger penalty (lambda.1se*.5)"
##                         Current Infection Day Before Infection Neither
## (Intercept)                       -2.3480              -1.7468  4.0947
## V_HEMOGLOBIN                      -1.0116               0.2084  0.8032
## V_PAO2                            -0.3908              -0.1064  0.4972
## V_MEANARTERIAL_PRESSURE           -0.2955               0.0970  0.1985
## V_CREATININE                      -0.1283              -0.1007  0.2290
## V_PB_SYSTOLIC                     -0.0728               0.0341  0.0386
## V_PACO2                            0.0445               0.0068 -0.0513
## V_PLATELET_COUNT                   0.0678              -0.0187 -0.0491
## V_WHITE_BC                         0.2374              -0.0031 -0.2342
## V_RESPIRATORY_RATE                 0.2387               0.0210 -0.2597
## V_POTASSIUM                        0.2761               0.0286 -0.3047
## V_HEART_RATE                       0.2981               0.0793 -0.3774
## V_BLOOD_UREA_NITROGEN              0.3128              -0.0855 -0.2272
## V_GLASCOWCOMA_SCALE                0.4977              -0.3865 -0.1112
## V_SODIUM                           0.9180               1.0231 -1.9411
## V_TEMPERATURE                      1.3008              -0.3017 -0.9991
## [1] "3. Model with data to first infection"
## [1] "Coefficients -- smaller penalty (lambda.min*.5)"
##                           Day Before Infection      Neither
## (Intercept)                       -1.579252814  1.579252814
## V_PACO2                           -0.161176042  0.161176042
## V_HCO3                            -0.020185000  0.020185000
## V_POTASSIUM                        0.004014613 -0.004014613
## V_FIO2                             0.046985008 -0.046985008
## V_CENTRAL_VENOUS_PRESSURE          0.080083762 -0.080083762
## V_GLUCOSE                          0.097333418 -0.097333418
## V_SODIUM                           0.339327711 -0.339327711
## [1] "Coefficients - larger penalty"
## Day Before Infection              Neither 
##            -1.230005             1.230005

Not surprisingly, temperature has by far the largest coefficient values. Second is the MODS score (Multiple Organ Dysfunction Score), and I’m not sure what that means or if it makes sense in context. As expected by the study authors, respiratory rate, white blood cell count, and heart rate are somewhat correlated with the outcomes. In the more limited coefficient set, only the Glascow coma scale is a negative predictor, i.e. the higher the score the less likely an infection outcome. Also, all coefficients are stronger for current infection than day before infection, with opposite sign for no infection. The same is not true of the larger set.

Another noteable outcome is that the two variables included to control for severity of condition are included in the larger model but with moderate coefficient values, and they are absent in the smaller model

PCR Study EDA

The figure compares the distribution of sepsis days per patient in the PCR study to the distribution of “onset” days per patient in the TT study. (According to the project readme, the variable “SEPSIS_STATUS” in the PCR data indicates whether the patient was determined to have a new onset of sepsis at that time point.) It’s similar overall, but with more patients in the PCR study having very high numbers of sepsis days.

I will simplify this section by only comparing sepsis days to non-sepsis days, and I will see if I get a similar list of variables associated with infection as in the TT study.

Distribution of vitals for sepsis vs. non-sepsis patients evaluated daily, excluding outliers

Distribution of vitals for sepsis vs. non-sepsis patients evaluated daily, excluding outliers

Figure compares the distribution of vital statistics for sepsis and non-sepsis days in patient history, after removing “outliers”, i.e .the bottom and top two percent of each distribution. White blood cell count seems to behave similarly as in the TT study, but heart rate and platelet count seem to have the opposite association. For example, average heart seems to be lower for the sepsis group.

Below are fit coefficients from only the less-restrictive cross-validated multinomial model. The more restrictive model is left out since this model is already fairly small.

Note that although temperature did not appear to be significant based on the distribution plots, it is again the most important term in the model. As above, coefficients are shown on a standardized scale.

Data oddities

  • A number of entries with infection Onset = “Yes” have infection Any = “No”. How is that possible?
sum(TT$Onset=="Yes" & TT$Any=="No", na.rm = T) 
## [1] 47
  • How should we treat the cases where infection presents in the first couple days?
# PCR STUDY
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  101   11    8    1    4    5    3    7    8   12    8    6    2    2    5 
##   16   17   18   19   20   21   22   23   24   25   27   28   30   34   35 
##    2    3    2    3    2    3    1    1    1    1    1    2    1    1    2 
##   36   37   40   43   48   52 <NA> 
##    1    1    2    1    1    1    0
# TT STudy
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##    5   10   10   11    7   11    9   12    4    8    6    6    6    6    6 
##   16   17   18   19   20   22   23   24   25   26   29   30   31   33   34 
##    2    6    4    2    1    2    1    1    2    3    1    2    2    1    1 
##   43   46   52   85  118 <NA> 
##    4    1    2    1    1  189